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Abstract 

In information retrieval research, precision and recall have long been used to evaluate IR systems. However, given that 
a number of retrieval systems resembling one another are already available to the public, it is valuable to retrieve novel 
relevant documents, i.e., documents that cannot be retrieved by those existing systems. In view of this problem, we 
propose an evaluation method that favors systems retrieving as many novel documents as possible. We also used our 
method to evaluate systems that participated in the IREX workshop. 



1. Introduction 

In information retrieval (IR) research, the notion of 
precision and recall have commonly been used to evalu- 



ate the empir ical performance of systems ( Keen, 1992 
Salton, 1992|). Precision is the ratio of the number of 



relevant documents retrieved by a system under eval- 
uation, compared to the total number of documents 
retrieved by the system. On the other hand, recall 
is the ratio of the number of relevant documents re- 
trieved by the system, compared to the total relevant 
documents in a given benchmark test collection. 

In other words, the precision/recall-based evalu- 
ation method regards all the relevant documents as 
equally important or informative for the user, and thus 
highly values systems that retrieve as many relevant 
documents as possible, with little noise. 

However, in the real world, where a number of IR 
systems are available, for example, on the World Wide 
Web, it is often the case that the user has already 
read some of relevant documents using other systems. 
Thus, systems that always retrieve relevant documents 
similar to those retrieved by ubiquitous systems have 
little practical utility. In addition, meta search sys- 
tems, which integrate document sets retrieved by more 
than one system, are less effective, in the case where 
individual systems retrieve similar documents. 

In view of these problems, our proposed IR evalu- 
ation method favors systems that retrieve more novel 
documents, that is, relevant documents which cannot 
be retrieved by other existing systems. 

From a different perspective, our evaluation 
method is also effecti ve in producing test collections. 
The pooling method ( Voorhees, 199^ ) , which has com- 
monly been used to produce test collections, requires a 
variety of participating systems. However, in the case 
where most participating systems adopt similar tech- 
niques, it is not feasible to collect a sufficient "pool" 
(i.e., a set of candidates for relevant documents). Our 
evaluation method is expected to promote a develop- 
ment of IR systems with various concepts, and there- 
fore resolve the above problem. 

Section |2] formalizes the evaluation measure based 
on the novelty of documents, and Section ^ applies 



this measure to evaluate IR systems that participated 
in the IREX workshop ( Sekine and Isahara, 1999| ). 



2. Formalizing the Measure 

Instead of the notion of precision and recall, we pro- 
pose as a new evaluation measure the utility of system 
X with respect to relevant document d, Ud{x). This 
measure denotes the extent to which x contributes to 
providing the user with d, for a given query. Note that 
in this paper, d generally refers to a relevant document. 

From an information theoretical point of view, 
we calculate Ud{x) as the ratio of the probability 
that the user reads document d by using system x, 
P{D = d\S = x) , compared to the probability that the 
user reads d by using another system (i.e., even with- 
out using a;), P{D ^ d), as shown in Equation (|l|). 



Ud[x) = log 



P{D = d\S = x) 



(1) 



P{D = d) 

In the case where system x adopts a ubiquitous re- 
trieval technique, the value of P{D = d\S — x) be- 
comes similar to that of P{D = d), and thus the utility 
of X becomes small. On the other hand, the utility of 
X becomes greater as the number of novel relevant 
documents provided by x increases. 

We then calculate the total utility of x, U{x), by 
summing up C/d(a:)'s of all the relevant documents for 
the query, as shown in Equation (H). 



U{x) = J2Ud{x) 



(2) 



To sum up, our evaluation method favors systems with 
greater U{x). 

In Equation ([^), P{D = d) is the summation 
of P{D = d\S = y)'s for existing systems, averaged 
by the probability that the user utilizes system y, 
P{S — y). Thus, given a set of existing system exclud- 
ing X, E, we calculate P{D — d) as in Equation (^. 

P{D^d) = Y,P{D^d\S^y)-P{S^y) 

y&E 



T.P(D^d\S = y)--^ 



(3) 



yeE 



Here, note that we assume uniformity with respect to 

Finally, the crucial content is the way to estimate 
P{D = d\S = x), i.e., the probability that the user 
reads document d by using system x. It can safely 
be assumed that the user always reads the top docu- 
ment, di, and thus P{D = di\S = x) always takes 1. 
However, the probability that the user reads remaining 
documents becomes smaller according to their ranking. 

Given N documents sorted according to their rele- 
vance degree, in descending order, the user can choose 
a threshold for the ranking (i.e., the boundary until 
which he/she continues to read) out of N choices. Con- 
sequently, documents ranked lower than the threshold 
will be discarded. 

In other words, we can calculate P{D — d\S — x) as 
the probability that the user chooses a threshold equal 
to or greater than the ranking of d, as in Equation (|^) . 

N 



P{D^d\S^x) = E ^ 



«=''^.d (4) 

N - r^.d + 1 
N 

Here, rx4 is the ranking of document d determined by 
system x. 

3. A Case Study using the IREX 
Collection 

Our concern in this section is to investigate the 
characteristic of our evaluation method. For this pur- 
pose, we targeted IR systems participated in the IREX 
workshop ( Sekine and Isahara, lOOS] ), and compared 
the result obtained based on our newly proposed eval- 
uation method, with that based on the precision/recall. 
We also investigated reasons behind the difference be- 
tween those two results, if any. 

3.1. Overviev^r of the IREX Collection 

The IREX collection was produced through the 
IREX workshop ( Sekine and Isahara, 1999| ) , which con- 
sists of TREC-style IR and MUC-style named entity 
(NE) tasks for Japanese]^ Hereafter, the IREX collec- 
tion/workshop refers solely to that related to the IR 
task. 

The IREX collection consists of 30 queries, 211,853 
articles collected from two years worth of "Mainichi 



Shimbun" newspaper articles (Mainichi Shimbun, 1994 
1995 ),0 relevance assessment for each query, retrieval 
results of 22 participating systems, and technical de- 
tails of each system. 

Each query consists of the ID, description and 
narrative. While descriptions arc usually phrases to 



^pttp : //cs .nyu.edu/ cs/pro"jects/proteus/irex/ 


index-e .html 





briefly express the topic, narratives consist of sev- 
eral sentences and synonyms associated with the topic. 
Figure ^ shows an example query in the SGML form 
(translated into English by one of the organizers of the 
IREX workshop). 



<TOPIC> 

<T0PIC-ID>1001</T0PIC-ID> 
<DESCRIPTION>Corporate 
merging</DESCRIPTION> 
<NARRATIVE>The article describes a 
corporate merging and in the article, the 
name of companies have to be 
identifiable. Information including the 
field and the purpose of the merging have 
to be identifiable. Corporate merging 
includes corporate acquisition, corporate 
unifications and corporate 
buying . </NARRATIVE> 
</TOPIC> 

Figure 1: An example query in the IREX collection. 



Relevance assessment was performed based on the 



pooling method (Voorhees, 1998). That is, candidates 
for relevant documents were first pooled using the 22 
participating systems. Thereafter, for each candidate 
document, human experts assigned one of three ranks 
of relevance, i.e., "relevant", "partially relevant" and 
"irrelevant" . The average number of documents pooled 
for each query is 2,105, among which the number of 
relevant and partially relevant documents are 68 and 
116, respectively. 

Each retrieval result consists of the top 300 articles 
submitted in the same form as used in the TRECJ3 
For each of the 22 results, the TREC evaluation soft- 
ware was used to investigate the performance (e.g., 
non- interpolated average precision). Figure ^ shows 
a fragment of the retrieval result obtained with one of 
the participating systems, which consists of the query 
ID, dummy field, article ID, ranking of the article, rel- 
evance degree computed by the system, and system 
ID. 



1007 





940228106 


1 





306856 


1106 


1007 





940110130 


2 





246505 


1106 


1007 





950106119 


3 





237173 


1106 


1007 





940131126 


4 





236115 


1106 


1007 





940614009 


5 





223313 


1106 


1007 





940614002 


6 





222998 


1106 


1007 





941107114 


7 





217324 


1106 


1007 





940428222 


8 





215979 


1106 



article IDs, which corresponds to articles in Mainichi Shim- 
bun newspaper CD-ROM'94-'95. Participants must get a 
copy of the CD-ROMs themselves. 



Figure 2: A fragment of the retrieval result of system 
"1106". 



It should be noted that using relevance assessment 



'^http : //tree .nist . gov/pubs .html 



Question 



Answers 



query information used only description (8), description+narrative (14) 

indexing method word (9), n-gram (3), word+character (2), character (1), syntactic phrase (1), 

statistical phrase (1) 
proper noun identification yes (5) 

query expansion local feedback (2), use of a thesaurus (2) 

retrieval method vector space model (13), probabilistic model (4), latent semantic indexing (1) 

Table 1: A fragment of the result of the IREX questionnaire. 



and retrieval results for each system, we can easily cal- 
culate P{D = d\S = x) in Equation (^), which is the 
central issue in estimating our evaluation measure. 

Technical details of participating systems were col- 
lected from questionnaires answered by each partici- 
pant, where questions ranged from retrieval algorithms 
used to execution time. Although several questions are 
relatively vague, a number of questions are effective to 
characterize each system. 

Table |l| shows representative questions in terms of 
retrieval accuracy. In this table, the number of answers 
are indicated in parentheses. However, answers clas- 
sified as "no", "unknown" and "etc." are not shown. 
Roughly speaking, most systems adopted the word- 
based indexing and vector space model combined with 
TF-IDF term weighting. 

On the other hand, note that in the IREX work- 
shop, the correspondence between system IDs and par- 
ticipants is not available to the public. Additionally, 
several participants did not have oral presentations and 
papers in the proceedings. Consequently, for some sys- 
tems it is difficult to obtain sufficient technical details. 

For example, although most participants answered 
"TF-IDF" for the question about term weighting 
method, it is not possible to identify the exact formula 
used, out of a number of variants ( |Salton and Buckley, 



1988; Zobel and Moffat, 1998), for several systems 



3.2. Experimentation 



As explained in Section 3.1., the 22 IREX partici- 



pating systems have already been ranked based on the 
conventional precision/recall, using the TREC evalua- 
tion software. 

Thus, we re-evaluated the 22 systems based on our 
evaluation method, and compared results derived from 
different evaluation methods. To put it more precisely, 
we conducted 22 trials in each of which a different sys- 
tem was under evaluation and the rest were regarded 
as existing systems. That is, the former and latter 
correspond to x and E in Section respectively. 

Note that in this evaluation, we did not regard 
"partially relevant" documents as relevant ones, be- 
cause interpretation of "partially relevant" is not fully 
clear to the authors. 

Table ^ compares rankings obtained based on non- 
interpolated average precision and the utility factor 
we proposed in this paper. Table || compares rank- 
ings obtained with two evaluation methods on a query- 



by-query basis, where we show solely the difference of 
rankings for enhanced readability. Since in the IREX 
collection, every query ID consists of four digits stating 
with "10" , we simply show the remaining two digits in 
Table |. 



System ID 


Avg. Precision 


Utility 


Difference 


1144b 


2 


1 


+ 1 


1135a 


3 


2 


+ 1 


1144a 


1 


3 


-2 


1135b 


4 


4 





1103b 


5 


5 





1106 


17 


6 


+ 11 


1145b 


16 


7 


+9 


1122b 


7 


8 


-1 


1103a 


10 


9 


+1 


1128b 


9 


10 


-1 


1142 


6 


11 


-5 


1122a 


8 


12 


-4 


1110 


11 


13 


-2 


1133a 


19 


14 


+5 


1133b 


18 


15 


+3 


1128a 


12 


16 


-4 


1120 


14 


17 


-3 


1145a 


13 


18 


-5 


1112 


15 


19 


-4 


1146 


20 


20 





1132 


22 


21 


+1 


1126 


21 


22 


-1 



Table 2: Comparison of rankings obtained based on 
non-interpolated average precision and utility factor. 



3.3. Discussion 

Looking at Table H, one may notice that rankings of 
systems "1106", "1145b", "1133a" and "1133b" were 
significantly improved within our evaluation method. 
Thus, we investigated properties that characterize each 
of those four systems, in a comparison with other sys- 
tems. 

First, we found that "1106" adopted a relatively 
simple implementation, while most systems used more 
elaborate ones. To put it more precisely, morphologi- 
cal analysis was performed, and nouns/ verbs were ex- 
tracted for a word-based indexing. For term weighting, 
a TF-IDF formula as in Equation (||) was used, while 
most systems used different methods, such as the log- 
arithmic TF formulation as in Equation m) and one 



Query ID 



System ID 


07 


OS 


09 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


24 


25 


26 


27 


28 


29 


30 


31 


32 


33 


34 


35 


36 


1103a 


8 


-7 


14 





8 


3 


3 


-14 


1 


13 


5 


-3 





-4 


-2 


3 


-6 


-3 


6 


1 


-2 


13 


2 


14 


-3 


-5 


-7 


-2 


-3 


3 


1103b 


-2 


-5 


6 


4 


-1 


-3 


-6 


-9 


4 


-5 


-1 


1 


-3 


-2 


-1 


8 





-2 


1 


-2 


-1 


7 


1 


-3 


-5 


-1 


-6 


-3 


-2 


5 


1106 


8 


-4 


-9 


-2 


9 


-2 


7 


11 


5 


-1 


-2 


-4 


5 


4 





-3 


-3 


2 








-1 


-1 


1 


2 


1 


2 





2 


17 





1110 


6 


-1 


-4 


4 


-1 


9 


-4 


-10 


-1 





4 


-2 


-5 


-1 





3 





-2 


-1 








16 


13 


-1 


-3 


-3 


8 


1 


3 


-2 


1112 


-2 


-5 








-5 


3 


-3 


1 


-11 





5 


-5 


12 


-2 


-1 


5 


-3 


-4 


-3 


-1 


-1 


-4 


-6 


-4 


3 


1 


-4 


-2 








1120 


1 


-2 


-2 


-1 





-3 


4 


-8 


-1 





5 


-2 


7 


1 





5 





2 





2 





-3 


-1 


-1 


2 


2 


6 


5 


-1 





1122a 


-2 


2 


-2 


-7 


-5 


5 


-5 


-11 


-1 


-5 


1 


8 


-1 


-6 


-2 


-8 


1 


1 





-1 


4 


-4 


1 


-1 


-3 


-1 


3 


-2 


-3 


-1 


1122b 


-5 





-8 


1 





-8 


1 


-5 


-9 


-5 





-2 


-3 


-6 


1 


-4 


4 





-2 


1 


7 


-3 


-2 


-4 


-4 





6 





-1 


-2 


1126 





4 


-10 








-2 





3 


-1 


-1 


-1 


1 


-1 




















1 


1 





-2 


-3 








-3 


-1 








1128a 


-1 


-1 


4 


-2 


-3 





3 


-6 


-8 


-1 


-3 


4 


2 


9 


1 


-13 





6 


2 


-1 





-2 


1 





-1 


1 


4 


-4 





4 


1128b 


-2 


14 


-4 


-4 


-7 


-5 


11 


9 


-2 


-2 


-5 


4 


-1 


3 


-2 


-13 


-1 


1 


2 


2 





1 





-5 


1 


-1 





-4 





-1 


1132 





16 


-9 


2 











12 


21 








10 





8 


15 





-4 

















2 








-1 





13 








1133a 


-2 


-2 


-4 





3 


2 


3 


15 


11 


1 


-5 


-1 


1 


7 


-1 


3 


4 


1 


4 


1 





-2 


-1 


1 


4 


7 


-1 








1 


1133b 


-3 


-2 


-4 


2 


3 


1 


11 


15 


3 





-4 


2 





5 


1 


6 


5 





3 


1 





-3 


-5 


-1 


10 


3 


-2 


-2 


1 


-1 


1135a 


-1 


-2 


9 


-2 


4 


-11 


-6 


4 


9 


2 


-6 


-4 


-1 


-1 


-1 


-2 


-3 


-1 


-1 


-1 





-2 


-2 





1 


-1 


-1 





-1 


-3 


1135b 


2 





6 


-1 


-12 


-13 


-6 


1 


2 





-3 


1 


-5 


-6 


-3 


-1 


-3 


-2 





-1 


-4 


-7 


-2 








-2 


-1 


-7 


-2 





1142 


-4 


-1 


10 





-5 


-1 


-7 


-14 


-7 


-3 


-2 


-3 


-4 


-7 


-5 


-2 


4 


-3 


-3 


-1 


-2 


-2 


-2 


-5 


2 


-6 


-7 


-6 


-1 


-4 


1144a 


-2 


-1 


-1 


3 


-1 


5 


-16 


-9 


-3 


5 


1 


-6 


-1 


-2 





6 


-1 


-2 


-2 


-3 








-2 


-1 





-4 


7 


2 


-1 


-1 


1144b 


-2 


3 


-1 


2 


-2 


5 


-16 


-5 


-2 


5 


2 


-5 


2 


-2 


1 


5 


-3 


1 


1 


-1 








-5 


-2 





1 


4 


2 


-1 


2 


1145a 





-4 


-7 


-4 


-5 


-1 


5 


11 


-2 


-1 


-1 


-3 


-1 


-1 


-1 


1 


8 


-3 


-5 


5 


-1 


-4 


5 


6 


-2 


2 


-4 


-3 


1 


-3 


1145b 


3 


-3 


-5 


5 


13 


7 


12 


13 


-5 


-1 


-2 


8 


-3 


4 





2 


1 


1 


-2 





-1 





5 


6 


-2 


7 





13 


-5 





1146 





1 


21 





7 


9 


9 


-4 


-3 


-1 


12 


1 





-1 





-1 





7 





-2 


1 





-1 


2 


-1 


-1 


-2 


-2 


-1 


3 



Table 3: Query- by-query comparison of rankings obtained based on non- interpolated average precision and utility 
factor. 



proposed by Robertson and Walker (1994) 



Jt,d ■ log — 
nt 



N 

(l + log/t,<j)-log- 
nt 



(5) 



(6) 



Here, ft,d denotes the frequency that term t appears in 
document d, and rit denotes the number of documents 
containing term t. N is the total number of documents 
in the collection. 

Second, "H45b" conducted a query expansion (Qiu 



and Frei, 1993), while a few systems used query ex- 
pansion (e.g., one based on a thesaurus). In addition, 
a term weighing method based on mutual information 
between two terms was introduced. Possible rationales 
behind this method include that two terms frequently 
co-occur are effective to characterize the domain of 
documents, and are thus assigned with greater term 
weights. 

Third, "1133a" and "1133b" also used domain 
knowledge for term weighting. However, unlike the 
case of "1145b", they regarded pages of news articles 
as domain. In practice, a greater weight is assigned 
to terms whose distribution varies more strongly de- 
pending on the page, because they are expected to 
characterize the domain. On the other hand, terms 
commonly appear in more pages are assigned with a 
lesser weight. 

To sum up, our novelty-based evaluation revealed 
the effectiveness of those properties above, specifi- 
cally term weighting methods introduced in "1145b", 
"1133a" and "1133b" , which were overshadowed or un- 
derestimated within the precision/recall-based evalua- 
tion. 

We devote a little space to consider Table || for fur- 
ther investigation. We arbitrarily regarded improve- 
ments above seven as significant, and focused solely 



on systems with relatively many significant improve- 
ments, that is, "1103a" and "1132". Although "1145b" 
is associated with the same number of significant im- 
provements as "1132", we previously discussed system 
"1145b" above. 

We found that "1103a" is one of five systems that 
conducts a proper noun identification, and that five 
of six queries where "1103a" achieved significant im- 
provements are directly or indirectly associated with 
proper nouns. 

Samples of query descriptions directly and indi- 
rectly related to proper nouns include "1016: Nick 
Price (a golfer)" and "1011: arrest of suspects of rob- 
bery in the Kanto region" , respectively. Note that in 
the latter (indirect) case, Japanese prefectures within 
the ^^Kanto^^ region, which are not explicitly described 
in the query (e.g., ^^Tokyo^^ and ^^Kanagawa^^), must 
be identified in news articles. 

Finally, "1132" is the only system that used Latent 
Semantic Indexing (LSI), which is an extension of the 
vector space model, so as to retrieve relevant docu- 
ments including no common terms in a given query. 
While as shown in Table |, "1132" had the lowest 
ranking in terms of the average precision, our evalu- 
ation method indicated that in many cases (queries) 
an LSI-based method is expected to retrieve relevant 
documents that other types of methods fail to retrieve. 

4. Conclusion 

Evaluation methods based on precision and recall 
have long been used in information retrieval (IR) re- 
search, where systems that retrieve as many relevant 
documents as possible are usually highly valued. 

However, given the fact that a number of retrieval 
systems resembling one another are available to the 
public (not only in laboratories), it is valuable to re- 
trieve relevant documents that can never be retrieved 



by those existing systems. This notion is also true in 
various contexts that require a variety of IR systems, 
such as meta search systems and the poohng method 
in producing IR test collections. 

In consideration of these factors, we proposed a 
new evaluation method for IR, which favors systems 
that retrieve more novel documents, i.e., relevant doc- 
uments that many systems fail to retrieve. To realize 
this notion, we estimated the utility of a system in 
question by comparing the probability that the user 
reads relevant documents by using the system, and 
the probability that the user can read those documents 
even without using the system. 

Wc also applied our evaluation method to the 22 
systems that participated in the IREX workshop, and 
identified several effective techniques that have been 
underestimated in the conventional precision/recall- 
based evaluation method. 
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